Statistical thinking — exploratory data analysis
2024-09-19
Bar charts are a simple way of showing & comparing means or other values between groups of data
Top of the box is drawn at the data values you wish to plot, bottom at 0
However, barcharts are wasteful & often don’t focus on the real information you want to convey
Dotplots show the same information but focus on the difference in the variable plotted
Common to show a measure of spread or variability in dotplots or barcharts
Standard errors of means or confidence intervals usually more useful than standard deviations
A boxplot is a useful summary of samples with \(n > 8\) observations
Shows the median, the upper and lower hinges & whiskers, plus any observations that lie beyond the whiskers
Useful to Compare observations from 2 or more groups
Histograms; a graphical representation of the frequency or density distribution of data
Choosing the bin width is a critical step in drawing a histogram1
\(\hat{\sigma}\) is the estimated SD, \(n\) number of observations, & \(\mathrm{IQR}\) the interquartile range
Quantile-quantile (QQ) plots are useful to determine if a sample is normally distributed
Draws quantiles of the data & a reference distribution. If normally distributed, points should fall on line through upper & lower quartiles of both distributions
100 random draws from a \(t_3\) distribution — heavy tails compared to normal
Thus far dealt only with univariate data displays. Other displays needed for bivariate and multivariate data
A scatterplot displays the relationship between two variables, \(x\) and \(y\) say
Each point on the plot represents the value of \(x_i\) and \(y_i\) for a single observation \(i\)
Important to plot data so you aren’t surprised when you model it
Violin plots can be thought of as a combination of a boxplot and a density plot
Boxplots and violin plots can be criticised because they don’t show the data
Raincloud plots are an alternative that does — the dots are binned like a histogram
Dotplots are a related graph that shows the data
Beeswarm plots show all the data in a compact deisplay and avoid points overlapping
Linear least squares regression makes some strong assumptions about your data; often these don’t hold
All hope abandon ye who enter here?
Often a transformation of the data can make them follow the assumptions more closely — though there are better ways
Transformation also play an important role in EDA
Powers & roots are a useful set of transformations: \(x \rightarrow x^p\)
If \(p\) is positive have a power transformation; \(p\) negative we have an inverse power
If \(p\) is a fraction we have a root transformation
Highly skewed distributions are difficult to explore because most of the data is scrunched up at one end of the distribution
Descend ladder of powers & roots to \(\mathsf{\log(x)}\), pulls in right tail. Ascending the ladder does the opposite; pulls in left tail is data negatively skewed
Transformation can help render many types of nonlinear relationships roughly linear
Clear that we are thinking about pairs of variables here; \(x\) and \(y\)
Need to choose whether to transform \(y\), \(x\), or both?
New York air quality data & various transformation to linearise bivariate relationship